Session 7: Scraping Static Web Pages

Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2024-07-30

Introduction

This Course

tinytable_jk1m0qbfypp8r9c98rmj
Day Session
1 Introduction
2 Data Structures and Wrangling
3 Working with Files
4 Linking and joining data & SQL
5 Scaling, Reporting and Database Software
6 Introduction to the Web
7 Static Web Pages
8 Application Programming Interface (APIs)
9 Interactive Web Pages
10 Building a Reproducible Research Project

The Plan for Today

In this session, we trap some docile data that wants to be found. We will:

  • Go over some parsing examples:
    • Wikipedia: World World Happiness Report
  • Discuss some examples of good approaches to data wrangling
  • Go into a bit more detail on requesting raw data

Original Image Source: prowebscraper.com

Joe Caione via unsplash.com

Example: World Happiness Report

Use your Browser to Scout

Use your Browser’s Inspect tool

Note: Might not be available on all browsers; use Chromium-based or Firefox.

Use rvest to scrape

library(rvest)
library(tidyverse)

# 1. Request & collect raw html
html <- read_html("https://en.wikipedia.org/w/index.php?title=World_Happiness_Report&oldid=1165407285")

# 2. Parse
happy_table <- html |> 
  html_elements(".wikitable") |> # select the right element
  html_table() |>                # special function for tables
  pluck(3)                       # select the third table

# 3. No wrangling necessary
happy_table
# A tibble: 153 × 9
   `Overall rank` `Country or region` Score `GDP per capita` `Social support`
            <int> <chr>               <dbl>            <dbl>            <dbl>
 1              1 Finland              7.81             1.28             1.5 
 2              2 Denmark              7.65             1.33             1.50
 3              3 Switzerland          7.56             1.39             1.47
 4              4 Iceland              7.50             1.33             1.55
 5              5 Norway               7.49             1.42             1.50
 6              6 Netherlands          7.45             1.34             1.46
 7              7 Sweden               7.35             1.32             1.43
 8              8 New Zealand          7.3              1.24             1.49
 9              9 Austria              7.29             1.32             1.44
10             10 Luxembourg           7.24             1.54             1.39
# ℹ 143 more rows
# ℹ 4 more variables: `Healthy life expectancy` <dbl>,
#   `Freedom to make life choices` <dbl>, Generosity <dbl>,
#   `Perceptions of corruption` <dbl>
## Plot relationship wealth and life expectancy
ggplot(happy_table, aes(x = `GDP per capita`, y = `Healthy life expectancy`)) + 
  geom_point() + 
  geom_smooth(method = 'lm')

Exercises 1

  1. Get the table with 2023 opinion polling for the next United Kingdom general election from https://en.wikipedia.org/wiki/Opinion_polling_for_the_2024_United_Kingdom_general_election
  2. Wrangle and plot the data opinion polls

Example: UK prime ministers on Wikipedia

Use your Browser to Scout

Use rvest to scrape

# 1. Request & collect raw html
html <- read_html("https://en.wikipedia.org/w/index.php?title=List_of_prime_ministers_of_the_United_Kingdom&oldid=1166167337") # I'm using an older version of the site since some just changed it

# 2. Parse
pm_table <- html |> 
  html_element(".wikitable:contains('List of prime ministers')") |>
  html_table() |> 
  as_tibble(.name_repair = "unique") |> 
  filter(!duplicated(`Prime ministerOffice(Lifespan)`))

# 3. No wrangling necessary
pm_table
# A tibble: 75 × 11
   Portrait...1 Portrait...2 Prime ministerOffice(Lifespa…¹ `Term of office...4`
   <chr>        <chr>        <chr>                          <chr>               
 1 "Portrait"   "Portrait"   Prime ministerOffice(Lifespan) start               
 2 "​"           ""           Robert Walpole[27]MP for King… 3 April1721         
 3 "​"           ""           Spencer Compton[28]1st Earl o… 16 February1742     
 4 "​"           ""           Henry Pelham[29]MP for Sussex… 27 August1743       
 5 "​"           ""           Thomas Pelham-Holles[30]1st D… 16 March1754        
 6 "​"           ""           William Cavendish[31]4th Duke… 16 November1756     
 7 "​"           ""           Thomas Pelham-Holles[32]1st D… 29 June1757         
 8 ""           ""           John Stuart[33]3rd Earl of Bu… 26 May1762          
 9 ""           ""           George Grenville[34]MP for Bu… 16 April1763        
10 ""           ""           Charles Watson-Wentworth[35]2… 13 July1765         
# ℹ 65 more rows
# ℹ abbreviated name: ¹​`Prime ministerOffice(Lifespan)`
# ℹ 7 more variables: `Term of office...5` <chr>, `Term of office...6` <chr>,
#   `Mandate[a]` <chr>, `Ministerial offices held as prime minister` <chr>,
#   Party <chr>, Government <chr>, MonarchReign <chr>
<td rowspan="4">
  <span class="anchor" id="18th_century"></span>
   <b>
     <a href="/wiki/Robert_Walpole" title="Robert Walpole">Robert Walpole</a>
   </b>
   <sup id="cite_ref-FOOTNOTEEccleshallWalker20021,_5EnglefieldSeatonWhite19951–5PrydeGreenwayPorterRoy199645–46_28-0" class="reference">
     <a href="#cite_note-FOOTNOTEEccleshallWalker20021,_5EnglefieldSeatonWhite19951–5PrydeGreenwayPorterRoy199645–46-28">[27]</a>
   </sup>
   <br>
   <span style="font-size:85%;">MP for <a href="/wiki/King%27s_Lynn_(UK_Parliament_constituency)" title="King's Lynn (UK Parliament constituency)">King's Lynn</a>
   <br>(1676–1745)
  </span>
</td>
links <- html |> 
  html_elements(".wikitable:contains('List of prime ministers') b a") |>
  html_attr("href")
title <- html |> 
  html_elements(".wikitable:contains('List of prime ministers') b a") |>
  html_text()
tibble(name = title, link = links)
# A tibble: 90 × 2
   name                 link                                             
   <chr>                <chr>                                            
 1 Robert Walpole       /wiki/Robert_Walpole                             
 2 George I             /wiki/George_I_of_Great_Britain                  
 3 George II            /wiki/George_II_of_Great_Britain                 
 4 Spencer Compton      /wiki/Spencer_Compton,_1st_Earl_of_Wilmington    
 5 Henry Pelham         /wiki/Henry_Pelham                               
 6 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
 7 William Cavendish    /wiki/William_Cavendish,_4th_Duke_of_Devonshire  
 8 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
 9 George III           /wiki/George_III                                 
10 John Stuart          /wiki/John_Stuart,_3rd_Earl_of_Bute              
# ℹ 80 more rows

Note: these are relative links that need to be combined with https://en.wikipedia.org/ to work

Exercises 2

  1. For extracting text, rvest has two functions: html_text and html_text2. Explain the difference. You can test your explanation with the example html below.
html <- "<p>This is some text
         some more text</p><p>A new paragraph!</p>
         <p>Quick Question, is web scraping:

         a) fun
         b) tedious
         c) I'm not sure yet!</p>" |> 
  read_html()
  1. How could you convert the links objects so that it contains actual URLs?
  2. How could you add the links we extracted above to the pm_table to keep everything together?

Example: Getting content from embedded json

html <- read_html("https://news.sky.com/story/crowdstrike-company-that-caused-global-techno-meltdown-offers-partners-10-vouchers-to-say-sorry-and-they-dont-work-13184488")

data <- html %>%
  rvest::html_element("[type=\"application/ld+json\"]") %>%
  rvest::html_text() %>%
  jsonlite::fromJSON()

datetime <- data$datePublished %>%
  lubridate::as_datetime()

# headline
headline <- data$headline

# author
author <- data$author$name

text <- html %>%
  rvest::html_elements(".sdc-article-body p") %>%
  rvest::html_text2() %>%
  paste(collapse = "\n")

Exercises 3

  1. Get the author, publication datetime, headline and text from this site: https://www.cnet.com/tech/services-and-software/facebook-hopes-to-normalize-idea-of-data-scraping-leaks-says-leaked-internal-memo/ (hint: it works in a very similar way, but you have to apply one extra data wrangling step)

Example: zeit.de

Special Requests: Behind Paywall

Let’s get this cool data journalism article.

html <- read_html("https://www.zeit.de/mobilitaet/2024-04/deutschlandticket-klimaschutz-oeffentliche-verkehrsmittel-autos-verkehrswende")
html |> 
  html_elements(".article-body p") |> 
  html_text2()
[1] "Ganz Deutschland fährt Bahn. So fühlte sich das im Sommer 2022 zumindest an, als das 9-Euro-Ticket für drei Monate für überfüllte Züge sorgte. Die Bundesregierung und viele Menschen zeigten sich begeistert: So leicht war es also, Bürgerinnen und Bürger für die umweltfreundlichen öffentlichen Verkehrsmittel zu begeistern, man muss nur ein günstiges Ticket für ganz Deutschland anbieten."
[2] "Aber als die Bundesregierung den Nachfolger vorstellte, waren viele enttäuscht. 49 Euro monatlich kostet das Deutschlandticket und ist nur im Abo erhältlich. Euphorisch war nur noch die Bundesregierung. Doch jetzt, ein Jahr nach dem Start, kann man sagen: zu Recht. Zumindest, was die Fahrgastzahlen angeht."                                                                                

🤔 Wait, that’s only the first two paragraphs!

💡 Websites use cookies to remember users (including logged in ones)

What are browser cookies

  • Small pieces of data stored on the user’s device by the web browser while browsing websites
  • Purpose:
    • Session Management: Maintain user sessions by storing login information and keeping users logged in as they navigate a website.
    • Personalization: Save user preferences, such as language settings or theme choices, to enhance user experience.
    • Tracking and Analytics: Track user behavior across websites for analytics and targeted advertising.
  • We can use them in scraping:
    • to get content from websites that require consent before giving access
    • to authenticate as a user with content access privileges
    • to access personalized content
    • to simulate real user behavior, reducing the chances of getting blocked by websites with anti-scraping measures
  • You can use browser extensions like “Get cookies.txt” for Chromium-based browsers or “cookies.txt” for Firefox to save your cookies to a file
  • Implications:
    • You need to keep cookies secure as they can authenticate others as you!

Special Requests: Behind Paywall Cookies!

library(cookiemonster)
add_cookies("cookies.txt")
html <- request("https://www.zeit.de/mobilitaet/2024-04/deutschlandticket-klimaschutz-oeffentliche-verkehrsmittel-autos-verkehrswende") |> # start a request
  req_options(cookie = get_cookies("zeit.de", as = "string")) |> # add cookies to be sent with it
  req_perform() |> 
  resp_body_html() # extract html from response

html |> 
  html_elements(".article-body p") |> 
  html_text2()

Example: South African Parliament (a special case)

library(httr2)
html <- request("https://web.archive.org/web/20240519142346/https://www.parliament.gov.za/register-members-Interests") |> 
  req_timeout(100) |> 
  req_perform() |> 
  resp_body_html()

links <- html |> 
  html_elements(".parly-h2+ul a") |> 
  html_attr("href")

years <- html |> 
  html_elements(".parly-h2+ul a") |> 
  html_text()

dir.create("data/za", showWarnings = FALSE)
interest_pdfs <- tibble(
  link = links, year = years
) |> 
  mutate(file_name = paste0("data/za/", year, ".pdf"))

if (!file.exists("data/za/2018.pdf")) {
  curl::multi_download(
    urls = interest_pdfs$link, 
    destfiles = interest_pdfs$file_name
  )
}

What do we want

  • REGISTER OF MEMBERS’ INTERESTS

Scraping data from PDFs?

  • Data inside a PDF is actually not such an uncommon case
  • Many institutions share PDFs with tables, images and lists of data
  • We can use some of our new pattern finding skills to scrape data from these PDFs as well though
    • Session names seem to be in a larger font and bold
    • Paper titles are in italics
    • Authors are either bold or plain font

Let’s investigate the PDF a little

library(pdftools)
comptext <- pdf_data("data/za/2018.pdf", font_info = TRUE)
comptext[[2]]
# A tibble: 172 × 8
   width height     x     y space text              font_name    font_size
   <int>  <int> <int> <int> <lgl> <chr>             <chr>            <dbl>
 1    78      9    35    65 TRUE  Abraham-Ntantiso, Arial-BoldMT      8.78
 2    31      9   115    65 TRUE  Phoebe            Arial-BoldMT      8.78
 3    24      9   150    65 FALSE (ANC)             Arial-BoldMT      8.78
 4     6      8    41    83 TRUE  1.                Arial-BoldMT      7.50
 5    31      8    54    83 TRUE  SHARES            Arial-BoldMT      7.50
 6    16      8    87    83 TRUE  AND               Arial-BoldMT      7.50
 7    26      8   106    83 TRUE  OTHER             Arial-BoldMT      7.50
 8    40      8   134    83 TRUE  FINANCIAL         Arial-BoldMT      7.50
 9    42      8   177    83 FALSE INTERESTS         Arial-BoldMT      7.50
10    25      8    54    91 TRUE  Nothing           ArialMT           7.50
# ℹ 162 more rows

We see here that:

  • each page is an element in a list
  • each word is in one row of the table
  • it contains the font_size and font_name
  • the position of each word on tha page is given with x and y coordinates

Let’s investigate a few words we saw above:

# a politician name
comptext[[2]] |> 
  filter(str_detect(text, "Abrahams,"))
# A tibble: 1 × 8
  width height     x     y space text      font_name    font_size
  <int>  <int> <int> <int> <lgl> <chr>     <chr>            <dbl>
1    45      9    35   473 TRUE  Abrahams, Arial-BoldMT      8.78
# an item header
comptext[[2]] |> 
  filter(str_detect(text, "1"))
# A tibble: 7 × 8
  width height     x     y space text  font_name    font_size
  <int>  <int> <int> <int> <lgl> <chr> <chr>            <dbl>
1     6      8    41    83 TRUE  1.    Arial-BoldMT      7.50
2    10      8    39   350 TRUE  10.   Arial-BoldMT      7.50
3    10      8    40   376 TRUE  11.   Arial-BoldMT      7.50
4    10      8    39   403 TRUE  12.   Arial-BoldMT      7.50
5    10      8    39   429 TRUE  13.   Arial-BoldMT      7.50
6     6      8    41   492 TRUE  1.    Arial-BoldMT      7.50
7    10      8    39   749 TRUE  10.   Arial-BoldMT      7.50
# a disclose
comptext[[2]] |> 
  filter(str_detect(text, "disclose"))
# A tibble: 19 × 8
   width height     x     y space text      font_name font_size
   <int>  <int> <int> <int> <lgl> <chr>     <chr>         <dbl>
 1    29      8    90    91 FALSE disclose. ArialMT        7.50
 2    29      8    90   118 FALSE disclose. ArialMT        7.50
 3    29      8    90   144 FALSE disclose. ArialMT        7.50
 4    29      8    90   170 FALSE disclose. ArialMT        7.50
 5    29      8    90   196 FALSE disclose. ArialMT        7.50
 6    29      8    90   268 FALSE disclose. ArialMT        7.50
 7    29      8    90   295 FALSE disclose. ArialMT        7.50
 8    29      8    90   358 FALSE disclose. ArialMT        7.50
 9    29      8    90   385 FALSE disclose. ArialMT        7.50
10    29      8    90   411 FALSE disclose. ArialMT        7.50
11    29      8    90   437 FALSE disclose. ArialMT        7.50
12    29      8    90   500 FALSE disclose. ArialMT        7.50
13    29      8    90   526 FALSE disclose. ArialMT        7.50
14    29      8    90   553 FALSE disclose. ArialMT        7.50
15    29      8    90   579 FALSE disclose. ArialMT        7.50
16    29      8    90   605 FALSE disclose. ArialMT        7.50
17    29      8    90   631 FALSE disclose. ArialMT        7.50
18    29      8    90   658 FALSE disclose. ArialMT        7.50
19    29      8    90   684 FALSE disclose. ArialMT        7.50
# a word inside table
comptext[[2]] |> 
  filter(str_detect(text, "Pringle"))
# A tibble: 1 × 8
  width height     x     y space text    font_name font_size
  <int>  <int> <int> <int> <lgl> <chr>   <chr>         <dbl>
1    23      8   188   721 TRUE  Pringle ArialMT        7.50
# a table header
comptext[[2]] |> 
  filter(str_detect(text, "Description"))
# A tibble: 3 × 8
  width height     x     y space text        font_name    font_size
  <int>  <int> <int> <int> <lgl> <chr>       <chr>            <dbl>
1    41      8    56   223 FALSE Description Arial-BoldMT      7.50
2    41      8    56   322 FALSE Description Arial-BoldMT      7.50
3    41      8    56   711 FALSE Description Arial-BoldMT      7.50
  • It looks like we can say relatively easily where a new politician entry starts based on the font
  • The item header has the same font name, but a different size
  • We can tell quite easily on which items there is nothing to disclose
  • The table colnames are similar to item headers, but start at a different x location
p1 <- comptext[[2]]
p1 |> 
  filter(font_name == "Arial-BoldMT", round(font_size, 1) == 7.5,
         x < 56) |> View()

Exercises 4

  1. In the folder /data (relative to this document) there is a PDF with some text. Read it into R
  2. The PDF has two columns, bring the text in the right order as a human would read it
  3. Let’s assume you wanted to have this text in a table with one column indicating the section and one having the text of the section
  4. Now let’s assume you wanted to parse this on the paragraph level instead

Optional Homework

You have seen some tools and tricks to scrape websites now. But your best ally in web scraping is experience! Until tomorrow noon, your task is to find a page on Wikipedia you find interesting and scrape content from there. Even if you don’t fully succeed, document the steps you take and note down where the information can be found. If you want to try to get some data you actuall need from a different website, your’re also welcome. But note that if you collect raw html in R and the data is not where it should be (e.g., the html elements containing panel names do not exist), you might have discovered a more advanced site, which we will cover later. Note that down and try another conference.

Deadline: Friday before class

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pdftools_3.4.0     httr2_1.0.1        rvest_1.0.4        lubridate_1.9.3   
 [5] forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2       
 [9] readr_2.1.5        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1     
[13] tidyverse_2.0.0    tinytable_0.3.0.10

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3    utf8_1.2.4        generics_0.1.3    xml2_1.3.6       
 [5] lattice_0.22-6    stringi_1.8.4     hms_1.1.3         digest_0.6.35    
 [9] magrittr_2.0.3    evaluate_0.23     grid_4.4.1        timechange_0.3.0 
[13] fastmap_1.1.1     Matrix_1.7-0      jsonlite_1.8.8    processx_3.8.4   
[17] chromote_0.2.0    ps_1.7.7          promises_1.3.0    mgcv_1.9-1       
[21] httr_1.4.7        selectr_0.4-2     fansi_1.0.6       scales_1.3.0     
[25] cli_3.6.3         rlang_1.1.4       splines_4.4.1     munsell_0.5.1    
[29] withr_3.0.0       yaml_2.3.8        tools_4.4.1       tzdb_0.4.0       
[33] colorspace_2.1-0  curl_5.2.1        vctrs_0.6.5       R6_2.5.1         
[37] lifecycle_1.0.4   pkgconfig_2.0.3   pillar_1.9.0      later_1.3.2      
[41] gtable_0.3.5      glue_1.7.0        Rcpp_1.0.12       xfun_0.44        
[45] tidyselect_1.2.1  rstudioapi_0.16.0 knitr_1.46        farver_2.1.2     
[49] nlme_3.1-164      htmltools_0.5.8.1 websocket_1.4.1   labeling_0.4.3   
[53] rmarkdown_2.26    qpdf_1.3.3        compiler_4.4.1    askpass_1.2.0